Optimizing hierarchical tree dissection parameters using historic epidemiologic data as ‘ground truth’

Hierarchical clustering of pathogen genotypes is widely used to complement epidemiologic investigations of outbreaks. Investigators must dissect trees to obtain genetic partitions that provide epidemiologists with meaningful information. Statistical approaches to tree dissection often require a user-defined parameter to predict the optimal partition number and augmenting this parameter can drastically impact resultant partition memberships. Here, we demonstrate how to optimize a given tree dissection parameter to maximize accuracy irrespective of the tree dissection method used. We hierarchically clustered 1,873 genotypes of the foodborne pathogen Cyclospora spp., including 587 possessing links to historic outbreaks. We dissected the resulting tree using a statistical method requiring users to select the value of a ‘stringency parameter’ (s), with a recommended value of 95% to 99.5%. We dissected this hierarchical tree across s-values from 94% to 99.5% (at increments of 0.25%), to identify a value that maximized partitioning accuracy, defined as the degree to which genetic partitions conform to known epidemiologic groupings. We show that s-values of 96.5% and 96.75% yield the highest accuracy (> 99.9%) when clustering Cyclospora sp. isolates with known epidemiologic linkages. In practice, the optimized s-value will generate robust genetic partitions comprising isolates likely derived from a common food source, even when the epidemiologic grouping is not known prior to genetic clustering. While the s-value is specific to the tree dissection method used here, the optimization approach described could be applied to any parameter/method used to dissect hierarchical trees.


Introduction
Hierarchical clustering is widely used in the field of molecular epidemiology to detect groups of genetically related pathogen isolates. However, an important limitation of hierarchical clustering is that hierarchical clusters are nested, meaning that small clusters comprising closely related isolates exist within larger clusters that get progressively larger as genetic relationships become increasingly distant. Consequently, investigators must dissect hierarchical trees into Barratt's heuristic outperforming all other methods when evaluating how accurately genetic clusters reflect gold standard epidemiologic clusters [12], as we propose to do here. In a recent study, dissection of hierarchically clustered Cyclospora spp. MLST data to identify discrete partitions comprising closely related isolates, was performed using a statistical framework that requires selecting a value for the user-defined 'stringency' parameter [4]. In that study [4], we recommended that the stringency parameter be set to a value above 95% and below 100%, though we justified the use of the maximum recommended s-value of 99.5% to dissect a hierarchically clustered dataset of more than 1,000 Cyclospora spp. MLST genotypes [4]. Setting the stringency to 99.5% resulted in the delimitation of genetic partitions where 90.8% of epidemiologically linked isolates were also linked genetically (i.e., 90.8% sensitivity) [4]. We also advised that users should consider optimizing the stringency parameter (s) to maximize performance, though specific details on how this may be achieved were not provided [4]. Therefore, the aim of this study was to demonstrate how a given tree dissection parameter-in this case, the value of the stringency parameter-can be optimized using historic epidemiologic data to improve tree dissection accuracy. Ultimately, we show that compared to when tree dissection parameter values are empirically selected, optimization of parameters in the way described does result in genetic partitions that more accurately reflect the epidemiologic linkage of clustered genotypes.

Genotyping data
We utilized a publicly available MLST dataset for Cyclospora spp. generated by the United States (U.S.) Centers for Disease Control and Prevention (CDC), the Public Health Agency of Canada, and certain U.S. State public health departments, as part of ongoing Cyclospora spp. genotyping performed during 2018, 2019, 2020, and 2021 [8,9,[13][14][15][16][17][18]. To maximize the diversity of isolates included this analysis, we also included genotypes from persons who became infected in China and Indonesia, and from persons presenting with cyclosporiasis in the UK after returning from travel. Briefly, this dataset comprised 1,873 Cyclospora sp. genotypes. These isolates had been sequenced at eight markers as previously described [8,9,18], including six nuclear markers and two mitochondrial markers. Illumina data from these isolates were accessed under NCBI BioProject Number PRJNA578931. Each isolates' genotype had been ascertained using bioinformatic workflows previously described [8].

Epidemiologic information
Epidemiologic information for a subset of these 1,873 genotypes was collected prior to this study through Cyclosporiasis National Hypothesis Generating Questionnaires (CNHGQ) during routine US public health surveillance. Each CNHGQ included information on a casepatient's food consumption history during a two-week period before becoming ill. Using this information, 587 isolates included in this analysis had been confidently linked to an outbreak or event that occurred in the USA, for which more than one isolate was genotyped (Table 1). Genotypes possessing clear epidemiologic links represented a reference for expected (i.e., 'ground truth') clustering outcomes when assessing clustering performance (see below). Isolates that could not be linked confidently to an outbreak cluster were designated as possessing "unknown epidemiologic linkage". Isolates in this "unknown" category also included all isolates from outside the USA as CNHGQs were not collected for cyclosporiasis patients outside the USA.

Distance calculation and partition number selection
A pairwise distance matrix was calculated from these Cyclospora spp. genotypes using Barratt's heuristic definition of genetic distance as previously described [3,19,20]. This matrix was hierarchically clustered using Ward's method implemented via the agnes function in the R package 'cluster' [21]. Next, we applied Plucinski and Barratt's framework as previously described [4] to dissect the resulting hierarchical tree into a k number of discrete partitions across 23 different stringency values (s-values): those ranging from 94% to 99.5%, at intervals of 0.25%. The number of discrete partitions (k) predicted using each of these 23 s-values was recorded. We subsequently dissected the hierarchical tree into the number of partitions (k) predicted for each s-value using the cutree R function [22]. The partition memberships  10 13 Vehicle (an herb) identified.

PLOS ONE
resulting from each of these 23 tree dissection iterations was used to assess partitioning performance for each of the corresponding 23 s-values. All hierarchical trees in this manuscript were generated using ggtree in R [23].

Assessment of partitioning performance
For each of the 23 s-values tested, we classified clustering results obtained for each genotype as either a true positive (TP), false positive (FP), true negative (TN), or false negative (FN), using the definitions described below. From these classifications we calculated various performance metrics including sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy, as previously described [8]. The calculations were weighted by the ratio of genotyped isolates in each epidemiologic cluster to the total number of genotyped isolates with epidemiologic links (n = 587) so that larger epidemiologic clusters (i.e., with more genotyped isolates) would contribute more to the final values. Given that accuracy is a measure of proximity of results from the true value, we proposed that the optimal stringency setting would be the value of k that results in maximum accuracy, as determined by the equation: After identifying the stringency setting that maximized accuracy, we assessed the discriminatory power of obtained using this setting by calculating Simpson's index of diversity (D) as described elsewhere [7]. The value of D was determined by: where N is the total number of isolates (n = 1,873), S is the number of partitions (i.e., equal to k), and n j represents the number of isolates within the jth partition. D is calculated with all isolates, not just those with epidemiological linkages. Simpson's index assesses a method's ability to distinguish between unrelated strains sampled randomly from a given species [7], where values of D close to 1.0 generally indicate good discriminatory power. We therefore considered this an indicator of whether the optimal stringency value (i.e., the one that maximizes accuracy) also provides useful strain discrimination.

Classification of epidemiologically-linked isolates after clustering
To compute partitioning accuracy, each of the 587 isolates with epidemiologic links were classified as a TP, TN, FP, or FN based on whether they were correctly assigned to the same partition as their epidemiologically-linked partners or not. Previous investigations showed that most epidemiologically-linked isolates included in this analysis possess a similar genetic signature [8,9,18]. Therefore, each epidemiologic cluster would have a partition number (i.e., a genetic cluster) to which the majority of its epidemiologically-linked isolates would be assigned. For the purposes of classification, we refer to this as the 'mode' partition number for an epidemiologic cluster. True positives would comprise isolates that were correctly assigned to the mode partition number for their epidemiologic cluster. Next, if we consider a fictitious epidemiologic cluster called "Outbreak A", true negatives for the "Outbreak A" cluster would include all isolates from Outbreaks X, Y, and Z that were not assigned to the mode partition for outbreak A. False negatives would include isolates that were not assigned to the mode partition number for their epidemiologic cluster. False positives would include isolates with a particular epidemiologic linkage that were assigned to a different partition to that of their epi-linked partners, and to a partition alongside isolates with a different epidemiologic linkage. Importantly, isolates belonging to different epidemiologic clusters can share the same mode partition number (i.e., unrelated outbreaks caused by the same strain). Therefore, isolates with the same mode partition number but possess a distinct epidemiologic linkage were not classified as false positive linkages for the purposes of our analysis. The classifications were performed for each epidemiologic cluster separately and the sum of all TP, TN, FP, and FN classifications for each epidemiologic cluster was used to compute an overall value of clustering accuracy for each stringency setting used, in addition to other performance metrics.

Ethics
Ethics approval for the use of clinical specimens was reviewed by the CDC Center for Global Health Human Research Protection Office under project determination number 2018-123. The need for patient informed consent was waived because the specimens were de-linked from any personal identifiers prior to submission to CDC.

Results
Stringency values of 96.5% and 96.75% produced identical performance results ( Table 2) and were established as optimal for partitioning our hierarchically clustered dataset. All svalues � 96.5% resulted in partitions with zero false positive links, yielding a specificity and PPV of 1 (Table 2). Conversely, all values � 96.75% resulted in partitions with the fewest false negatives (n = 6), which maximized sensitivity and NPV (Table 2). Consequently, the optimal s-values of 96.5% and 96.75% meant we minimized both false positives and false negatives,  (Tables 2 and 3). The optimal stringency setting also maximized the number of partitions that isolates with epidemiologic links (n = 587) were distributed across (17 partitions, Table 2). Sub-optimal accuracy was observed at an s-value of 99.5%, which maximized the number of false negatives (n = 39) as a consequence of dividing many linked isolates across different partitions (Table 2). Conversely, an s-value of 94% (the minimum value evaluated) yielded 73 false positives, due to assignment of many unrelated isolates to the same partition. The higher number of false positives at lower stringency values is a consequence of isolates from different epidemiologic clusters being assigned to the same partition. Specifically, this Colored rows indicate multiple outbreaks caused by the same strain, as evidenced by the fact that at all values of k the isolates associated with these different outbreaks remained genetically linked. Therefore, genetic linkage of isolates from unrelated outbreaks shaded the same color was not considered a false-positive linkage in this study because the causative strains were the same based on the genotyping scheme employed. Res. = restaurant, Dist. = Distributor, Temp. = temporospatial. a Four discrete epidemiologically-defined clusters that were originally assigned to the same genetic partition at lower stringency values, yet separate into three distinct genetic partitions at higher stringency values isolates from these four distinct outbreaks were assigned to the same partition at a stringency of 94% (Fig 1). At optimal s-values, only isolates linked to the 2020 Salad Chain A (E.C.N. -22) and 2019 Distributor A (E.C.N.-16) epidemiological clusters remained in the same partition (Fig 2), supporting that these two outbreaks were caused by the same strain. In contrast isolates linked to 2018 Temporospatial Cluster A (E.C.N-07), and the 2021 FL Italian-style restaurant (E.C.N-24) cluster were divided across two distinct genetic partitions at higher stringency values (Fig 2). At stringency values above the established optima, some isolates were incorrectly separated from their epidemiologically linked partners (i.e., false negatives). For example, out of the 132 genotyped isolates linked to the pre-packaged salad mix 2020_001 cluster, all 132 belonged to a single genetic cluster at the optimal settings, while 11 isolates were split into other partitions at a stringency of 99.5% (Fig 3, S1 File). Likewise, 6 of the 94 specimens belonging to 2018 Vendor A split from their epi-linked partners at 99.5% stringency (Fig 3).
Simpson's index of diversity ranged from 0.9100 at k = 15 (s-values of 94% through 94.5%) to 0.9335 at k = 67 (s = 99.5%) ( Table 2). At the optimal stringency values, k = 30 and Simpson's index of diversity was 0.9210, which is indicative of good discriminatory power.

Discussion
We recently described a framework for dissecting hierarchically clustered genetic data that requires investigators to provide a user-defined stringency value that impacts downstream genetic partition memberships [4]. We recommended that the stringency parameter be set to a value between 95% and 100% and here we describe a process by which the selection of this parameter can be refined. While this seems like a small range of values, we demonstrate that even minor adjustments to the s-value can have a major impact on the resultant genetic partitions and subsequently, the perceived genetic linkages. This underpins the need for investigators to optimize user-defined parameters that impact the process of hierarchical tree dissection, regardless of the method employed.
Specifically, our results highlight the importance of selecting parameter values that maximize partitioning accuracy. In our investigation, all stringency values evaluated resulted in partitioning at an accuracy greater than 99%; however, at more relaxed stringency values (i.e., < 95%) greater than 70 false positives were observed, while at higher stringency values (i.e., > 98.5%) more than 30 false negatives were observed. At the optimal value established here, 0 false positives and only 6 false negatives were observed. Therefore, arbitrarily selecting a given An important characteristic of any molecular epidemiologic tool is that the intersect between epidemiologically-defined clusters and their analogous genetic partitions should be nearing 100% [7]. Optimization of tree dissection parameters in this context should include computation of accuracy using epidemiologically defined clusters as a 'ground truth' reference for expected clustering outcomes. These reference genotypes of known epidemiologic linkage should be clustered alongside isolates of unknown epidemiologic linkage that represent possible candidates for downstream epidemiologic investigation. Resulting partitions identified using optimized parameter values that containing isolates with an unknown epidemiologic linkage will subsequently have a high likelihood of being derived from a common source, and thus represent robust candidates for epidemiologic follow-up. This is because their assignment to the same partition was based on parameters optimized to an internal reference of expected genetic links.
For epidemiologic investigations of cyclosporiasis outbreaks, patients complete a CNHGQ that collects information on the foods they recall consuming days to weeks prior to falling ill. Cyclosporiasis often presents as intermittent, non-specific symptoms many days to weeks after consumption of contaminated produce, meaning that weeks may elapse between illness onset and CNHGQ interview. This delay can make it difficult for patients to recall specific meal components, potentially introducing noise to epidemiologic datasets [9]. Regardless, our experience with Cyclospora spp. has consistently demonstrated good concordance (~90% or more) between linkages identified via CNHGQ and genetic clustering. Accuracy of 100% may never be observed when assuming epidemiologic data as a true representative of 'ground truth', due to various sources of noise [9]. However, given the generally strong concordance between genotyping and these epidemiologic methods [8,9], selecting parameter values that maximize accuracy seems warranted, as this--in our experience-will usually yield partitions of unknown linkage that are more likely derived from a common source.
The discriminatory power of our dissected tree, determined by Simpson's index of diversity (D), was 0.9210 at the optimal s-values which resulted in 30 partitions. This is slightly lower than the value of D = 0.95 recommended elsewhere as an indicator of good discriminator power [7]. None of the stringency values evaluated here exceeded 0.95 (we observed a maximum D = 0.9335 at stringency = 99.5%), which is likely the result of a confluence of multiple factors. First, our dataset is heavily weighted to isolates from cyclosporiasis case patients residing in the United States (U.S.) between 2018 and 2021, which is unlikely to reflect the full genetic diversity of Cyclospora spp. (i.e., only genotypes causing U.S. infections during these periods were analyzed). Second, the current MLST-based genotyping approach captures only a portion of the~45 megabase Cyclospora spp. genome [6]; the MLST method involves sequencing 8 genetic markers each less than 1 kilobase in length each. Finally, Simpson's index of diversity is a formula where datasets with greater richness (i.e., high number of clusters) and evenness (i.e., every cluster has a similar number of isolates) will have greater D-values compared to those with less richness and/or evenness. Our dataset consists of isolates from numerous cyclosporiasis outbreaks of varying sizes (Table 1), meaning richness and evenness are constrained by outbreak history, which impacts the value of D. Novel Cyclospora spp. types are identified each year [8,9,18] and work is being done to increase the number of markers used to genotype Cyclospora spp., thus discriminatory power will likely increase in response to these updates.
The sequencing of additional/different Cyclospora spp. MLST markers would warrant a reassessment of the optimal s-value, as subsequent tree structures may be impacted by the inclusion of the additional genetic information. Likewise, a large increase in the number of outbreaks caused by novel Cyclospora spp. genetic types, may also augment the resultant tree topology and thus be an impetus for re-evaluating the optimal value of the stringency parameter. Generally, when factors impact tree structure (e.g., new markers) or when the gold standard epidemiologic references do not encompass the observed genetic diversity (e.g., outbreaks from novel types) it is highly recommended that tree dissection parameters (i.e., SNP distance thresholds, or the stringency parameter in this case) be re-optimized. Nevertheless, the presently evaluated epidemiologic clusters represent the currently observed genetic diversity fairly well (Fig 1). Our optimal s-value (i.e., 96.5 to 96.75) was determined using a set of genotypes with gold-standard epidemiologic groupings, plus approximately 1,300 isolates without epidemiologic linkages. The optimal s-value described here remains a robust choice when applied to Cyclospora spp. that include the same gold-standard genotypes and a selection of the 1,300 additional isolates used here, in addition to any clinical isolates of interest to the investigator. A suggested reference dataset is provided (S2 File).
To conclude, we describe a simple approach that has proven useful for optimizing hierarchical tree dissection parameters to facilitate subsequent epidemiologic investigations. While the present example applies specifically to optimization of the stringency parameter for a particular tree dissection framework, this same approach could easily be used to optimize parameter values that are applicable to any tree dissection approach. We anticipate that other molecular epidemiologists will find this work useful, especially in contexts where optimized parameters for tree dissection have not yet been established for certain pathogens.
Supporting information S1 File. Complete clustering results. This excel file contains a full list of calculation and results for accuracy and Simpson's D at each s-value. (XLSX) S2 File. Analysis support files. This excel file includes the list of the suggested reference isolates, as well as the haplotype sheet and distance matrix used for clustering in this manuscript. The file also includes the epidemiologic linkages for each isolate. (XLSX)